Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for sending error codes on session close #121

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Aug 26, 2024

This adds support for sending error codes on closing a session.

The error isn't guaranteed to be sent to remote. It depends on the LINGER value for the TCP socket and also on whether there's any unread data in the receive buffer when close was called. In both these situations, the GoAway frame might be dropped.

To reliably send error codes, we'd have to send a TCP FIN packet and wait for remote its half of the connection. This would also require sending everything that's pending in the kernel write buffer. To not introduce this 1RTT delay of closing, I've opted to make this a best effort implementation.

@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch from 9d54c73 to f480b10 Compare August 26, 2024 15:42
@sukunrt sukunrt requested review from MarcoPolo and Stebalien August 26, 2024 15:43
@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch from f480b10 to efa52bc Compare August 26, 2024 15:44
@sukunrt sukunrt changed the title add support for sening error codes on session close add support for sending error codes on session close Aug 26, 2024
@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch 4 times, most recently from af1ac71 to 18a75f1 Compare August 26, 2024 16:35
@Stebalien
Copy link
Member

Is there an issue describing how this will be used? I usually want to send errors when closing a stream, less when closing a connection. Is the plan to use this in the connection manager?

@sukunrt
Copy link
Member Author

sukunrt commented Aug 27, 2024

Apologies! The specs are here: libp2p/specs#623
The corresponding change in go-libp2p. There's only quic support for now. libp2p/go-libp2p#2927

In go-libp2p we will mostly use Connection Close error codes from the connection manager. Applications can define their error codes.

Stream error codes will be introduced in a separate PR.

@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch from 18a75f1 to 4b262c0 Compare August 27, 2024 06:31
@sukunrt sukunrt marked this pull request as ready for review August 27, 2024 10:13
@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch from 4aed194 to 8adb9a8 Compare August 27, 2024 16:09
@Stebalien
Copy link
Member

Oh, I see.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it probably works, but blocking on connection close is "new" (as far as I know) so we need to make sure it's not going to cause issues with other parts of the code.

session.go Outdated Show resolved Hide resolved
// The GoAway may not actually be sent depending on the semantics of the underlying net.Conn.
// For TCP connections, it may be dropped depending on LINGER value or if there's unread data in the kernel
// receive buffer.
func (s *Session) CloseWithError(errCode uint32) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we updated the connection manager to be able to deal with potential blocking here? Also, we should probably document it.

const.go Outdated Show resolved Hide resolved
@sukunrt
Copy link
Member Author

sukunrt commented Aug 28, 2024

Blocking on a connection is unfortunate. The "correct" way here would be to send the error code with the RST packet. The latest TCP RFC also recommends this: https://www.rfc-editor.org/rfc/rfc9293.html#name-reset-processing

TCP implementations SHOULD allow a received RST segment to include data (SHLD-2). It has been suggested that a RST segment could contain diagnostic data that explains the cause of the RST. No standard has yet been established for such data.

But no implementation provides this API at the moment.

@sukunrt
Copy link
Member Author

sukunrt commented Aug 28, 2024

but blocking on connection close is "new"

We can consider running this Async in a different goroutine. That'll use more memory and mess up the resource manager accounting for a short duration, but it'll cause less issues with existing code.

The current implementation wouldn't block in most of the cases. If there's receive window available at the remote end, it wont block.

@Stebalien
Copy link
Member

So, we usually only close connections from the connection manager, right? IMO, we should consider:

  1. Being more aggressive on timeouts.
  2. Close in parallel (in the connection manager).

On the other hand.... spawning a goroutine isn't terrible. The resource consumption is fairly minimal in modern go and blocking will tie up resources just the same.

@Stebalien
Copy link
Member

Actually... we already have a goroutine. Can we reuse it? Are we not accounting for that one in the resource manager?

@MarcoPolo
Copy link
Contributor

So, we usually only close connections from the connection manager, right?

A common case will also be the resource manager closing connections after seeing some limit has been reached.

@sukunrt
Copy link
Member Author

sukunrt commented Nov 18, 2024

@Stebalien

Actually... we already have a goroutine. Can we reuse it? Are we not accounting for that one in the resource manager?

We can but that will not account for the resources held up in the background goroutine. The way it works currently from go-libp2p is:

yamuxConn.CloseWithError(1)
connScope.Done()

The alternative, clean up the resources and mark the scope done in a background goroutine, feels worse to me. If you're closing a connection because you're short of resources, this doesn't help at all. If you're not closing the connection because you're short of resources, you can afford to wait, or do the whole thing in a separate goroutine without blocking the routine calling CloseWithError.

@sukunrt sukunrt self-assigned this Nov 18, 2024
@Stebalien
Copy link
Member

The alternative, clean up the resources and mark the scope done in a background goroutine, feels worse to me. If you're closing a connection because you're short of resources, this doesn't help at all.

It does help because it prevents you from using resources you don't have.

@MarcoPolo
Copy link
Contributor

My understanding is that this has the same behavior as before where .Close() is non-blocking. CloseWithError() is blocking, but that's a new API so no worry about breaking existing users. Is that correct?


The alternative, clean up the resources and mark the scope done in a background goroutine, feels worse to me. If you're closing a connection because you're short of resources, this doesn't help at all.

It does help because it prevents you from using resources you don't have.

Right. We should not be telling the resource manager we are done until after we are done. Otherwise we are open ourselves to attacks the resource manager is meant to prevent.

Maybe when we use this in go-libp2p we should set some threshold of conns we are willing to play nice for and send errors, but above that threshold we just hard reset with no error in order as to not stall at reclaiming resources. This should solve the common case of informing well-behaved peers about your error state, as well as avoiding the scenario where a malicious peer hogs up resources by being slow to receive the error.

Copy link
Contributor

@MarcoPolo MarcoPolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. We should test these on a libp2p node before cutting a release here. But we can do that after we merge if you like.

session.go Outdated
@@ -46,9 +46,9 @@ var nullMemoryManager = &nullMemoryManagerImpl{}
type Session struct {
rtt int64 // to be accessed atomically, in nanoseconds

// remoteGoAway indicates the remote side does
// remoteGoAwayNormal indicates the remote side does
// not want futher connections. Must be first for alignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What depends on this alignment? Apparently nothing as this has not been first for years.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing. I don't need the field either. I have removed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the present semantics are, if you call session.GoAway(), existing streams keep working but it prevents you from creating new streams. These two flags remoteGoAway and localGoAway implement this.

So far as I know, we don't use it anywhere in go-libp2p. Since this will be a major version upgrade, we can change the semantics to make GoAway reset all the streams as a session.Close does.

const.go Outdated
// ErrRemoteGoAway is used when we get a go away from the other side
ErrRemoteGoAway = &Error{msg: "remote end is not accepting connections"}
// ErrRemoteGoAwayNormal is used when we get a go away from the other side
ErrRemoteGoAwayNormal = &GoAwayError{Remote: true, ErrorCode: goAwayNormal}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some text to explain what "normal" is here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means a goAway frame was received with error code goAwayNormal. I changed the name back to the original to avoid changing a Public name.

s.exitErr(err)
// Prefer the recvLoop error over the sendLoop error. The receive loop might have the error code
// received in a GoAway frame received just before the TCP RST that closed the sendLoop
s.shutdownLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment as to why you are holding the shutdownLock around this section.

Copy link
Member Author

@sukunrt sukunrt Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comment, can you review once more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That looks better, thanks.

stream.go Outdated Show resolved Hide resolved
session_test.go Outdated Show resolved Hide resolved
session_test.go Outdated Show resolved Hide resolved
@sukunrt sukunrt force-pushed the sukun/conn-error-2 branch 2 times, most recently from af8e895 to b523bdd Compare November 19, 2024 13:49
@sukunrt
Copy link
Member Author

sukunrt commented Nov 19, 2024

@Stebalien

It does help because it prevents you from using resources you don't have.

Sure, but that's not what I'm arguing.

We have two design choices. From a users perspective, who cares about libp2p connections:

  1. Blocking: libp2pConn.CloseWithError blocks and frees up all the resources when it returns
  2. NonBlocking: libp2pConn.CloseWithError doesn't block and frees up resources in the background with correct accounting.

In terms of accounting both the approaches are correct they only mark resources as freed when they're actually freed.

In the Non Blocking implementation, how will a user who closes the connection because it wants to use the resource somewhere else be notified that the resources have been freed? This will require extending more API surface to signal this cleanup completion.

The inverse situation for the blocking approach is how will a user not block when closing a connection with error codes? This is much simpler. You simply close in a goroutine, and if you need to, signal on completion with a channel. One way we can use this is, as @MarcoPolo suggests:

Maybe when we use this in go-libp2p we should set some threshold of conns we are willing to play nice for and send errors, but above that threshold we just hard reset with no error in order as to not stall at reclaiming resources. This should solve the common case of informing well-behaved peers about your error state, as well as avoiding the scenario where a malicious peer hogs up resources by being slow to receive the error.

To be clear, the slowness is for our pending writes and peer has no control over it. We are only blocked by our in progress writes. The blocking for just writing the GoAwayFrame is much lower(100 ms).

@sukunrt
Copy link
Member Author

sukunrt commented Nov 19, 2024

The changes look good. We should test these on a libp2p node before cutting a release here. But we can do that after we merge if you like.

Let's merge this along with libp2p/go-libp2p#2927. This will not be used without libp2p/go-libp2p#2927, and this is not an actively worked on repo so there's not much rebasing work to do.

@sukunrt
Copy link
Member Author

sukunrt commented Nov 19, 2024

My understanding is that this has the same behavior as before where .Close() is non-blocking. CloseWithError() is blocking, but that's a new API so no worry about breaking existing users. Is that correct?

Correct.

@MarcoPolo
Copy link
Contributor

To be clear, the slowness is for our pending writes and peer has no control over it.

The peer controls what has been ack'd. Or am I missing something?

@Stebalien
Copy link
Member

In the Non Blocking implementation, how will a user who closes the connection because it wants to use the resource somewhere else be notified that the resources have been freed? This will require extending more API surface to signal this cleanup completion.

Why would they want this information? Users generally aren't managing connections that way, that's the job of the connection manager. I.e.:

  1. If resources get tight, the connection manager closes connections to free up resources.
  2. When and if resources are available, it'll once again be possible to create new connections.

Ideally with some wiggle room to prevent blocking.

Note: "waiting until the connection closes" isn't actually useful from a resource management perspective anyways because other connections could have been created in the meantime.

@Stebalien
Copy link
Member

The inverse situation for the blocking approach is how will a user not block when closing a connection with error codes? This is much simpler. You simply close in a goroutine, and if you need to, signal on completion with a channel. One way we can use this is, as @MarcoPolo suggests:

I completely agree, we should just close in a goroutine. It's informational anyways.

My point is just that:

  1. Users don't currently expect this to blocks, so I'd like to avoid blocking.
  2. I'd like the resources to be tracked somehow so we don't start allowing unbounded resource allocation.

Your point about setting some form of limit seems reasonable, although I figured we could just use the resource limiter for that.

Really, an alternative solution is to just be best-effort: instead of trying to re-use the goroutine, attempt to (non-blocking) allocate new resources with the resource manager to cover the new goroutine to send the error code. If that fails, just don't send the error code because we're clearly really low on resources and should just walk away ASAP.

@sukunrt
Copy link
Member Author

sukunrt commented Nov 21, 2024

I've implemented something like this:

Really, an alternative solution is to just be best-effort: instead of trying to re-use the goroutine, attempt to (non-blocking) allocate new resources with the resource manager to cover the new goroutine to send the error code. If that fails, just don't send the error code because we're clearly really low on resources and should just walk away ASAP.

In go-libp2p here: libp2p/go-libp2p@3c53bf6#diff-99b3819033614bf0bec2ee2aa8f71c99a9ebc98cbfd887a6242aac406904fe7d See the changes to net/upgrader/conn.go
The corresponding yamux change is here: #123
This changes CloseWithError to CloseWithErrorChan and lets go-libp2p do the resource management.

When not resource constrained, we will send the error code. When we are resource constrained, we will terminate immediately.

@Stebalien
Copy link
Member

Gah. That works but it's kind of nasty.

Taking a fresh look at this... blocking on CloseWithError isn't terrible as long as Close doesn't block. Maybe make it possible to pass a context as well?

I'm also going to bow out of this conversation because I feel that I might be causing more confusion than anything.

@sukunrt
Copy link
Member Author

sukunrt commented Nov 22, 2024

Your comments have been really helpful Steven! Thanks for the help here.

@Stebalien
Copy link
Member

Thanks! I'm just not paying enough attention to this conversation so I don't want to keep coming back once every few months with conflicting opinions, adding confusion to the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants